Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

er. The mapping function is shown below, where ݂ሺ܆ሻ is a

and ॷ is a label vector for four countries:

ॷൌ݂ሺ܆ሻ

the k-mers were ranked using a supervised machine learning

examine which k-mers dominated the difference of the genomic

etween countries.

h, how the genomics patterns were evolved through time in these

was investigated. This requires a regression analysis model and

t is shown below, where ॻ stands for the time lag between the

on date of each sequence and the date of the first occurrence of

ॻൌ݂ሺ܆ሻ

nomics distribution of sequences

ganising map (SOM) [Kohonen, 1982] model with 900 neurons

tructed for the dual-normalised 3-mer data set for these 58,897

s from four countries. Figure 7.18 shows the SOM map generated

ohonen package. The map has shown a clear pattern that the

s from four countries have been mapped to almost distinct areas.

ap, eight cells were empty, which had no sequence mapped. The

rate was therefore 99.11%. Among 892 cells, 839 cells were

by the sequences from a unique country, such as USA, India,

r Brazil. This indicated that 94.06% cells were pure for the

s of one country. This means that the decomposition of 58,897

s into four subsets (݂ሺ܆ሻ⟹⋃ሼΩ௎ௌ஺, Ωூ௡ௗ௜௔, Ω஻௥௔௭௜௟, Ωோ௨௦௦௜௔ሽ)

essful and this set of sequences did have some intrinsic significant

nome pattern to discriminate the virus genome from different

. Among them, 711 neurons were pure for USA, 66 neurons were

ndia, 17 neurons were pure for Russia and 45 neurons were pure

l. Each of 711 neurons only witnessed USA sequences, each of

ns only witnessed India sequences, etc. Some neurons (or cells or

ve evidenced the overlap of sequences from different countries. This